Salaries SF Fiscal year¶

Welcome to a quick exercise for you to practice your pandas skills! We will be using the SF Salaries Dataset from Kaggle! Just follow along and complete the tasks outlined in bold below. The tasks will get harder and harder as you go along.

Import pandas as pd.

In [1]:
import pandas as pd
import numpy as np

Read Salaries.csv as a dataframe called sal.

In [2]:
df_Salaire = pd.read_csv('Salaries.csv') 

Check the head of the DataFrame.

In [3]:
df_Salaire.head()
Out[3]:
Id EmployeeName JobTitle BasePay OvertimePay OtherPay Benefits TotalPay TotalPayBenefits Year Notes Agency Status
0 1 NATHANIEL FORD GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY 167411.18 0.00 400184.25 NaN 567595.43 567595.43 2011 NaN San Francisco NaN
1 2 GARY JIMENEZ CAPTAIN III (POLICE DEPARTMENT) 155966.02 245131.88 137811.38 NaN 538909.28 538909.28 2011 NaN San Francisco NaN
2 3 ALBERT PARDINI CAPTAIN III (POLICE DEPARTMENT) 212739.13 106088.18 16452.60 NaN 335279.91 335279.91 2011 NaN San Francisco NaN
3 4 CHRISTOPHER CHONG WIRE ROPE CABLE MAINTENANCE MECHANIC 77916.00 56120.71 198306.90 NaN 332343.61 332343.61 2011 NaN San Francisco NaN
4 5 PATRICK GARDNER DEPUTY CHIEF OF DEPARTMENT,(FIRE DEPARTMENT) 134401.60 9737.00 182234.59 NaN 326373.19 326373.19 2011 NaN San Francisco NaN

Use the .info() method to find out how many entries there are.

In [5]:
df_Salaire.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 148654 entries, 0 to 148653
Data columns (total 13 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   Id                148654 non-null  int64  
 1   EmployeeName      148654 non-null  object 
 2   JobTitle          148654 non-null  object 
 3   BasePay           148045 non-null  float64
 4   OvertimePay       148650 non-null  float64
 5   OtherPay          148650 non-null  float64
 6   Benefits          112491 non-null  float64
 7   TotalPay          148654 non-null  float64
 8   TotalPayBenefits  148654 non-null  float64
 9   Year              148654 non-null  int64  
 10  Notes             0 non-null       float64
 11  Agency            148654 non-null  object 
 12  Status            0 non-null       float64
dtypes: float64(8), int64(2), object(3)
memory usage: 14.7+ MB

What is the average BasePay ?

In [13]:
df_Salaire.BasePay.mean()  # or df_Salaire['BasePay'].mean()
Out[13]:
66325.44884050643

What is the highest amount of OvertimePay in the dataset ?

In [14]:
df_Salaire.OvertimePay.max()  # or df_Salaire[OvertimePay].max()
Out[14]:
245131.88

What is the job title of JOSEPH DRISCOLL ? Note: Use all caps, otherwise you may get an answer that doesn't match up (there is also a lowercase Joseph Driscoll).

In [30]:
df_Salaire[df_Salaire['EmployeeName'] == 'JOSEPH DRISCOLL'] # df_Salaire[df_Salaire.EmployeeName == 'JOSEPH DRISCOLL']
Out[30]:
Id EmployeeName JobTitle BasePay OvertimePay OtherPay Benefits TotalPay TotalPayBenefits Year Notes Agency Status
24 25 JOSEPH DRISCOLL CAPTAIN, FIRE SUPPRESSION 140546.86 97868.77 31909.28 NaN 270324.91 270324.91 2011 NaN San Francisco NaN
In [31]:
df_Salaire[df_Salaire['EmployeeName'] == 'JOSEPH DRISCOLL']['JobTitle']
Out[31]:
24    CAPTAIN, FIRE SUPPRESSION
Name: JobTitle, dtype: object

How much does JOSEPH DRISCOLL make (including benefits)?

In [32]:
df_Salaire[df_Salaire['EmployeeName'] == 'JOSEPH DRISCOLL']['TotalPayBenefits']
Out[32]:
24    270324.91
Name: TotalPayBenefits, dtype: float64

What is the name of highest paid person (including benefits)?

In [33]:
df_Salaire.TotalPayBenefits.max()
Out[33]:
567595.43
In [34]:
df_Salaire.TotalPayBenefits == df_Salaire.TotalPayBenefits.max()
Out[34]:
0          True
1         False
2         False
3         False
4         False
          ...  
148649    False
148650    False
148651    False
148652    False
148653    False
Name: TotalPayBenefits, Length: 148654, dtype: bool
In [39]:
df_Salaire[df_Salaire['TotalPayBenefits'] == df_Salaire['TotalPayBenefits'].max()]
Out[39]:
Id EmployeeName JobTitle BasePay OvertimePay OtherPay Benefits TotalPay TotalPayBenefits Year Notes Agency Status
0 1 NATHANIEL FORD GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY 167411.18 0.0 400184.25 NaN 567595.43 567595.43 2011 NaN San Francisco NaN
In [41]:
df_Salaire[df_Salaire['TotalPayBenefits'] == df_Salaire['TotalPayBenefits'].max()]['EmployeeName']
Out[41]:
0    NATHANIEL FORD
Name: EmployeeName, dtype: object

On peut également utiliser cette forme¶

In [42]:
df_Salaire.loc[df_Salaire['TotalPayBenefits'].idxmax()]
Out[42]:
Id                                                               1
EmployeeName                                        NATHANIEL FORD
JobTitle            GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY
BasePay                                                     167411
OvertimePay                                                      0
OtherPay                                                    400184
Benefits                                                       NaN
TotalPay                                                    567595
TotalPayBenefits                                            567595
Year                                                          2011
Notes                                                          NaN
Agency                                               San Francisco
Status                                                         NaN
Name: 0, dtype: object
In [44]:
df_Salaire.iloc[df_Salaire['TotalPayBenefits'].argmax()]
Out[44]:
Id                                                               1
EmployeeName                                        NATHANIEL FORD
JobTitle            GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY
BasePay                                                     167411
OvertimePay                                                      0
OtherPay                                                    400184
Benefits                                                       NaN
TotalPay                                                    567595
TotalPayBenefits                                            567595
Year                                                          2011
Notes                                                          NaN
Agency                                               San Francisco
Status                                                         NaN
Name: 0, dtype: object

What is the name of lowest paid person (including benefits)? Do you notice something strange about how much he or she is paid?

In [46]:
df_Salaire['TotalPayBenefits'].argmin()
Out[46]:
148653
In [48]:
df_Salaire.iloc[df_Salaire['TotalPayBenefits'].argmin()]
Out[48]:
Id                                      148654
EmployeeName                         Joe Lopez
JobTitle            Counselor, Log Cabin Ranch
BasePay                                      0
OvertimePay                                  0
OtherPay                               -618.13
Benefits                                     0
TotalPay                               -618.13
TotalPayBenefits                       -618.13
Year                                      2014
Notes                                      NaN
Agency                           San Francisco
Status                                     NaN
Name: 148653, dtype: object

On peut également utiliser la méthode ci-dessous;¶

In [49]:
df_Salaire[df_Salaire['TotalPayBenefits'] == df_Salaire['TotalPayBenefits'].min()]
Out[49]:
Id EmployeeName JobTitle BasePay OvertimePay OtherPay Benefits TotalPay TotalPayBenefits Year Notes Agency Status
148653 148654 Joe Lopez Counselor, Log Cabin Ranch 0.0 0.0 -618.13 0.0 -618.13 -618.13 2014 NaN San Francisco NaN

What was the average (mean) BasePay of all employees per year? (2011-2014) ?

In [51]:
df_Salaire.groupby('Year').mean()
Out[51]:
Id BasePay OvertimePay OtherPay Benefits TotalPay TotalPayBenefits Notes Status
Year
2011 18080.0 63595.956517 4531.065429 3617.081926 NaN 71744.103871 71744.103871 NaN NaN
2012 54542.5 65436.406857 5023.417824 3653.437583 26439.966967 74113.262265 100553.229232 NaN NaN
2013 91728.5 69630.030216 5281.641980 3819.969007 23829.076572 77611.443142 101440.519714 NaN NaN
2014 129593.0 66564.421924 5401.993737 3505.421251 24789.601756 75463.918140 100250.918884 NaN NaN
In [54]:
df_Salaire.groupby('Year').mean()['BasePay']
Out[54]:
Year
2011    63595.956517
2012    65436.406857
2013    69630.030216
2014    66564.421924
Name: BasePay, dtype: float64
In [56]:
df_Salaire.groupby('Year').mean()['BasePay']
Out[56]:
Year
2011    63595.956517
2012    65436.406857
2013    69630.030216
2014    66564.421924
Name: BasePay, dtype: float64

How many unique job titles are there?

In [58]:
df_Salaire.JobTitle.nunique()
Out[58]:
2159

What are the top 5 most common jobs?

In [7]:
df_Salaire.JobTitle.value_counts(5)
Out[7]:
Transit Operator                   0.047331
Special Nurse                      0.029525
Registered Nurse                   0.025132
Public Svc Aide-Public Works       0.016939
Police Officer 3                   0.016286
                                     ...   
MEDIA PRODUCTION SUPERVISOR        0.000007
PRINCIPAL RECREATION SUPERVISOR    0.000007
SENIOR GENERAL UTILITY MECHANIC    0.000007
Chief Forensic Toxicologist        0.000007
SENIOR FOOD SERVICE SUPERVISOR     0.000007
Name: JobTitle, Length: 2159, dtype: float64

How many Job Titles were represented by only one person in 2013? (e.g. Job Titles with only one occurence in 2013?)

Combien de titres d'emploi étaient représentés par une seule personne en 2013 ? (par exemple, les titres de postes n'ayant qu'une seule occurrence en 2013)

In [18]:
df_Salaire[df_Salaire['Year']==2013]['JobTitle'].value_counts()==1
Out[18]:
Transit Operator                  False
Special Nurse                     False
Registered Nurse                  False
Public Svc Aide-Public Works      False
Custodian                         False
                                  ...  
IS Operator-Journey                True
Chief Adult Probation Officer      True
Asphalt Plant Supervisor 1         True
Laboratory Technician I            True
Asst Chf, Bur Clm Invest&Admin     True
Name: JobTitle, Length: 1051, dtype: bool
In [19]:
sum(df_Salaire[df_Salaire['Year']==2013]['JobTitle'].value_counts()==1)
Out[19]:
202

How many people have the word Chief in their job title? (This is pretty tricky)

In [25]:
def chief_string(title):
    if 'chief' in title.lower().split():
        return True
    else:
        return False
In [26]:
chief_string('GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY')
Out[26]:
False
In [27]:
chief_string('chief MANAGER-METROPOLITAN TRANSIT AUTHORITY')
Out[27]:
True
In [30]:
df_Salaire.JobTitle.iloc[10]
Out[30]:
'ASSISTANT CHIEF OF DEPARTMENT, (FIRE DEPARTMENT)'
In [ ]:
 
In [33]:
df_Salaire.JobTitle.apply(lambda x: chief_string(x))
Out[33]:
0         False
1         False
2         False
3         False
4          True
          ...  
148649    False
148650    False
148651    False
148652    False
148653    False
Name: JobTitle, Length: 148654, dtype: bool
In [34]:
sum(df_Salaire.JobTitle.apply(lambda x: chief_string(x)))
Out[34]:
477
In [41]:
df_Salaire.JobTitle
Out[41]:
0         GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY
1                        CAPTAIN III (POLICE DEPARTMENT)
2                        CAPTAIN III (POLICE DEPARTMENT)
3                   WIRE ROPE CABLE MAINTENANCE MECHANIC
4           DEPUTY CHIEF OF DEPARTMENT,(FIRE DEPARTMENT)
                               ...                      
148649                                         Custodian
148650                                      Not provided
148651                                      Not provided
148652                                      Not provided
148653                        Counselor, Log Cabin Ranch
Name: JobTitle, Length: 148654, dtype: object
In [21]:
 
Out[21]:
477

Bonus: Is there a correlation between length of the Job Title string and Salary? ¶

Bonus : Y a-t-il une corrélation entre la longueur de la chaîne d'intitulé du poste et le salaire ?

In [47]:
df_Salaire['title_len'] = df_Salaire['JobTitle'].apply(len)
In [51]:
df_Salaire[['JobTitle', 'title_len']].corr()
Out[51]:
title_len
title_len 1.0

Great Job!¶